I Made A Dataset So Dense It Broke My Hard Drive

I have a new dataset. It is called Dense-PRISM. It lives on Hugging Face. It is 164 GB. My hard drive cried when I uploaded it. My internet provider sent me a concerned email. I am proud.

Density is not about size. Density is about information per byte. Dense-PRISM has so much information per byte that bytes are now asking for raises.

The Numbers

Let us talk about scale. Because numbers are fun and also terrifying.

164 GB

File Size

4096

Top-K Per Token

799

Prompts

∞

Regrets

The math works like this. Four thousand ninety-six top tokens logged for every single generated token. Seven hundred ninety-nine prompts. Average response length times four thousand ninety-six times seven hundred ninety-nine equals total training signals.

# Dense-PRISM by the numbers
4096 (top_k) * ~200 (avg tokens) * 799 (prompts) = ~654 million data points
                    

Six hundred fifty-four million training signals. From seven hundred ninety-nine prompts. That is the power of density. That is the curse of density. My hard drive understands the curse personally.

What Is In The File

Each entry contains the standard conversation format. User asks. Assistant answers. Then comes the gold. For every token in the response, you get the top 4096 alternatives with their log probabilities.

# Example Dense-PRISM Entry (abbreviated)
{
  "messages": [
    {"role": "user", "content": "Explain quantum entanglement simply"},
    {"role": "assistant", "content": "Quantum entanglement is a phenomenon..."}
  ],
  "response_tokens": 187,
  "token_logprobs": [
    {
      "position": 0,
      "generated_token": "Quantum",
      "logprob": -3.12,
      "top_k": [
        {"token": "The", "logprob": -1.2},
        {"token": "In", "logprob": -1.8},
        {"token": "Quantum", "logprob": -3.12},
        ... (4093 more alternatives)
      ]
    }
  ]
}
                    

That ellipsis represents four thousand ninety-three more tokens. Multiply that by every token in every response. You get Dense-PRISM. You get a file that makes file explorers hesitate.

Why 4096

Why not 50? Why not 100? Why not a reasonable number that does not break storage systems? Because 4096 is a power of two. Because it feels technical. Because I wanted to see what would happen.

Also, 4096 tokens covers a meaningful slice of the vocabulary. It shows the model not just the top choices but the entire neighborhood of possibilities. It teaches semantic distance through probability gradients.

A model trained on Dense-PRISM knows that "Why?" and "Hey! whats up?" live in different probability neighborhoods. It learns tone through math. It learns style through statistics.

The Free Part

Yes, it is free. MIT license. Download it. Fork it. Train your tiny models on it. Make something smarter than my tiny models. Please. My GPU needs the competition.

I could have put it behind a paywall. I could have made it exclusive. I did not. Open source is the point. Sharing is the point. Watching other people build cool things with my weird datasets is the point.

Storage Considerations

164 GB is large. It will take time to download. It will take space to store. It will take patience to parse. This is the cost of density.

# Tips for working with Dense-PRISM
Use streaming loaders, do not load entire file into memory
Filter by prompt type before training to reduce scope
Consider sampling top_k if full density is not needed
Have a large hard drive. Seriously.
                    

I learned these tips the hard way. My RAM cried. My swap file screamed. My patience evaporated. You do not need to repeat my mistakes. Learn from my pain.

What This Teaches

Standard distillation teaches what to say. Dense-PRISM teaches how to choose what to say. The student model sees the probability landscape. It understands why certain tokens fit certain contexts. It learns the shape of appropriate responses.

A model trained on this knows that formal questions deserve formal answers. It knows that casual greetings invite casual responses. It knows the distance between tones. It learns through exposure to the full spectrum of possibility.

The Math Again

Let us return to the formula because it is beautiful in a terrifying way.

# Total training signals
4096 * avg_tokens * 799 = total_signals

# Example calculation:
4096 * 200 * 799 = 654,540,800 signals

# That is six hundred fifty-four million
# training signals from seven hundred ninety-nine prompts
# This is why my hard drive filed a complaint
                    

Each prompt becomes a universe of possibilities. Each token becomes a lesson in probability. Each logprob becomes a teacher. This is distillation at maximum density.

Who Should Use This

People training small models. People who want their models to understand nuance. People who have large hard drives. People who enjoy watching progress bars move very slowly.

If you are training a model under 1B parameters, Dense-PRISM can teach it to speak with more intention. If you are training a model under 100M parameters, it can teach it to choose words with more care. If you are training a model under 10M parameters, it might teach it to form coherent sentences. Progress is relative.

Final Thoughts

Dense-PRISM exists. It is 164 GB. It has 4096 top tokens per generated token. It has 799 prompts. It is free. It is dense. It is available now on Hugging Face.

Download it if you dare. Train on it if you can. Make something better than my confused tiny models. That is the goal. That is the dream. That is Dense-PRISM.